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Abstract 



Learning problems form an important category of computational tasks that generalizes many of the 
computations researchers apply to large real-life data sets. We ask: what concept classes can be learned 
privately, namely, by an algorithm whose output does not depend too heavily on any one input or specific 
training example? More precisely, we investigate learning algorithms that satisfy differential privacy, a 
notion that provides strong confidentiality guarantees in contexts where aggregate information is released 
about a database containing sensitive information about individuals. 

Our goal is a broad understanding of the resources required for private learning in terms of samples, 
computation time, and interaction. We demonstrate that, ignoring computational constraints, it is pos- 
sible to privately agnostically learn any concept class using a sample size approximately logarithmic in 
the cardinality of the concept class. Therefore, almost anything learnable is leamable privately: specif- 
ically, if a concept class is learnable by a (non-private) algorithm with polynomial sample complexity 
and output size, then it can be learned privately using a polynomial number of samples. We also present 
a computationally efficient private PAC learner for the class of parity functions. This result dispels the 
similarity between learning with noise and private learning (both must be robust to small changes in 
inputs), since parity is thought to be very hard to learn given random classification noise. 

Local (or randomized response) algorithms are a practical class of private algorithms that have re- 
ceived extensive investigation. We provide a precise characterization of local private learning algorithms. 
We show that a concept class is learnable by a local algorithm if and only if it is learnable in the statis- 
tical query (SQ) model. Therefore, for local private learning algorithms, the similarity to learning with 
noise is stronger: local learning is equivalent to SQ learning, and SQ algorithms include most known 
noise-tolerant learning algorithms. Finally, we present a separation between the power of interactive 
and noninteractive local learning algorithms. Because of the equivalence to SQ learning, this result also 
separates adaptive and nonadaptive SQ learning. 

1 Introduction 

The data privacy problem in modern databases is similar to that faced by statistical agencies and medical 
researchers: to learn and publish global analyses of a population while maintaining the confidentiality of the 
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participants in a survey. There is a vast body of work on this problem in statistics and computer science. 
However, until recently, most schemes proposed in the literature lacked rigorous analysis of privacy and 
utility. 

A recent line of work ^HSHIIll31ISlEilEIElElEZl^lIHll23, initiated by Dinur and Nissim BSI 
and called private data analysis, seeks to place data privacy on firmer theoretical foundations and has been 
successful at formulating a strong, yet attainable privacy definition. The notion of differential privacy 11241 
that emerged from this line of work provides rigorous guarantees even in the presence of a malicious adver- 
sary with access to arbitrary auxiliary information. It requires that whether an individual supplies her actual 
or fake information has almost no effect on the outcome of the analysis. 

Given this definition, it is natural to ask: what computational tasks can be performed while maintaining 
privacy? Research on data privacy, to the extent that it formalizes precise goals, has mostly focused on 
function evaluation ("what is the value of /(z)?"), namely, how much privacy is possible if one wishes to 
release (an approximation to) a particular function /, evaluated on the database z. (A notable exception is the 
recent work of McSherry and Talwar, using differential privacy in the design of auction mechanisms [44]). 
Our goal is to expand the utility of private protocols by examining which other computational tasks can be 
performed in a privacy -preserving manner. 

Private Learning. Learning problems form an important category of computational tasks that generalizes 
many of the computations researchers apply to large real-life data sets. In this work, we ask what can be 
learned privately, namely, by an algorithm whose output does not depend too heavily on any one input or 
specific training example. Our goal is a broad understanding of the resources required for private learning 
in terms of samples, computation time, and interaction. We examine two basic notions from computational 
learning theory: Valiant's probabilistically approximately correct (PAC) learning ll5T1l model and Kearns' 
statistical query (SQ) model ll39ll . 

Informally, a concept is a function from examples to labels, and a class of concepts is learnable if for any 
distribution T> on examples, one can, given limited access to examples sampled from V labeled according 
to some target concept c, find a small circuit (hypothesis) which predicts c's labels with high probability 
over future examples taken from the same distribution. In the PAC model, a learning algorithm can access 
a polynomial number of labeled examples. In the SQ model, instead of accessing examples directly, the 
learner can specify some properties (i.e., predicates) on the examples, for which he is given an estimate, up 
to an additive polynomially small error, of the probability that a random example chosen from V satisfies 
the property. PAC learning is strictly stronger than the SQ learning ||39l . 

We model a statistical database as a vector z = (z\, ■ ■ ■ , z n ), where each entry has been contributed by 
an individual. When analyzing how well a private algorithm learns a concept class, we assume that entries 
z% of the database are random examples generated i.i.d. from the underlying distribution V and labeled by 
a target concept c. This is exactly how (not necessarily private) learners are analyzed. For instance, an 
example might consist of an individual's gender, age, and blood pressure history, and the label, whether this 
individual has had a heart attack. The algorithm has to learn to predict whether an individual has had a heart 
attack, based on gender, age, and blood pressure history, generated according to V. 

We require a private algorithm to keep entire examples (not only the labels) confidential. In the scenario 
above, it translates to not revealing each participant's gender, age, blood pressure history, and heart attack 
incidence. More precisely, the output of a private learner should not be significantly affected if a partic- 
ular example Zi is replaced with arbitrary z[, for all z% and z\. In contrast to correctness or utility, which 
is analyzed with respect to distribution V, differential privacy is a worst-case notion. Hence, when we 
analyze the privacy of our learners we do not make any assumptions on the underlying distribution. Such as- 
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sumptions are fragile and, in particular, would fall apart in the presence of auxiliary knowledge (also called 
background knowledge or side information) that the adversary might have: conditioned on the adversary's 
auxiliary knowledge, the distribution over examples might look very different from V. 

1.1 Our Contributions 

We introduce and formulate private learning problems, as discussed above, and develop novel algorithmic 
tools and bounds on the sample size required by private learning algorithms. Our results paint a picture of 
the classes of learning problems that are solvable subject to privacy constraints. Specifically, we provide: 

(1) A Private Version of Occam's Razor. We present a generic private learning algorithm. For any concept 
class C, we give a distribution-free differentially-private agnostic PAC learner for C that uses a number 
of samples proportional to log \ C\. This is a private analogue of the "cardinality version" of Occam's 
razor, a basic sample complexity bound from (non-private) learning theory. The sample complexity 
of our version is similar to that of the original, although the private algorithm is very different. As in 
Occam's razor, the learning algorithm is not necessarily computationally efficient. 

(2) An Efficient Private Learner for Parity. We give a computationally efficient, distribution-free dif- 
ferentially private PAC learner for the class of parity function^] over {0, l} d . The sample and time 
complexity are comparable to that of the best non-private learner. 

(3) Equivalence of Local ("Randomized Response") and SQ Learning. We precisely characterize the 
power of local, or randomized response, private learning algorithms. Local algorithms are a special 
(practical) class of private algorithms and are popular in the data mining and statistics literature ||53l |2j 
[T][3l|52l[29j|45l|36l. They add randomness to each individual's data independently before processing the 
input. We show that a concept class is learnable by a local differentially private algorithm if and only if 
it is learnable in the statistical query (SQ) model. This equivalence relates notions that were conceived 
in very different contexts. 

(4) Separation of Interactive and Noninteractive Local Learning. Local algorithms can be noninterac- 
tive, that is, using one round of interaction with individuals holding the data, or interactive, that is, using 
more than one round (and in each receiving randomized responses from individuals). We construct a 
concept class, called masked-parity, that is efficiently learnable by interactive local algorithms under the 
uniform distribution on examples, but requires an exponential (in the dimension) number of samples to 
be learned by a noninteractive local algorithm. The equivalence Q of local and SQ learning shows that 
interaction in local algorithms corresponds to adaptivity in SQ algorithms. The masked-parity class thus 
also separates adaptive and nonadaptive SQ learning. 

1.1.1 Implications 

"Anything" learnable is privately learnable using few samples. The generic agnostic learner ([T]l has an 
important consequence: if some concept class C is learnable by any algorithm, not necessarily a private one, 
whose output length in bits is polynomially bounded, then C is learnable privately using a polynomial num- 
ber of samples (possibly in exponential time). This result establishes the basic feasibility of private learning: 
it was not clear a priori how severely privacy affects sample complexity, even ignoring computation time. 

'While the generic learning result Q extends easily to "agnostic" learning (defined below), the learner for parity does not. The 
limitation is not surprising, since even non-private agnostic learning of parity is at least as hard as learning parity with random 
noise. 
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(a) (b) 

Figure 1: Two basic models for database privacy: (a) the centralized model, in which data is collected by a trusted 
agency that publishes aggregate statistics or answers users' queries; (b) the local model, in which users retain their 
data and run a randomization procedure locally to produce output which is safe for publication. The dotted arrows 
from users to data holders indicate that protocols may be completely noninteractive: in this case there is a single 
publication, without feedback from users. 



Learning with noise is different from private learning. There is an intuitively appealing similarity be- 
tween learning from noisy examples and private learning: algorithms for both problems must be robust to 
small variations in the data. This apparent similarity is strengthened by a result of Blum, Dwork, McSherry 
and Nissim ifTTTl showing that any algorithm in Kearns' statistical query (SQ) model [ 39j can be imple- 
mented in a differentially private manner. SQ was introduced to capture a class of noise -resistant learning 
algorithms. These algorithms access their input only through a sequence of approximate averaging queries. 
One can privately approximate the average of a function with values in [0, 1] over the data set of n individu- 
als to within additive error 0(l/n) (Dwork and Nissim ||26l ). Thus, one can simulate the behavior of an SQ 
algorithm privately, query by query. 

Our efficient private learner for parity Q dispels the similarity between learning with noise and private 
learning. First, SQ algorithms provably require exponentially many (in the dimension) queries to learn 
parity [[39). More compellingly, learning parity with noise is thought to be computationally hard, and has 
been used as the basis of several cryptographic primitives (e.g., |[T3"1 1351 141 [491). 

Limitations of local ("randomized response") algorithms. Local algorithms (also referred to as ran- 
domized response, input perturbation, Post Randomization Method (PRAM), and Framework for High- 
Accuracy Strict-Privacy Preserving Mining (FRAPP)) have been studied extensively in the context of privacy- 
preserving data mining, both in statistics and computer science (e.g., ll53l l2l[Tll3l l52ll29ll45ll36ll ). Roughly, 
a local algorithm accesses each individual's data via independent randomization operators. See Figure [T] 

p-EB 

Local algorithms were introduced to encourage truthfulness in surveys: respondents who know that 
their data will be randomized are more likely to answer honestly. For example, Warner [53] famously 
considered a survey technique in which respondents are asked to give the correct answer to a sensitive 
(true/false) question with probability 2/3 and the incorrect answer with probability 1/3, in the hopes that 
the added uncertainty would encourage them to answer honestly. The proportion of "true" answers in the 
population is then estimated using a standard, non-private deconvolution. The accepted privacy requirement 
for local algorithms is equivalent to imposing differential privacy on each randomization operator [29]. 
Local algorithms are popular because they are easy to understand and implement. In the extreme case, users 
can retain their data and apply the randomization operator themselves, using a physical device Il53"ll46l or a 
cryptographic protocol 0. 

The equivalence between local and SQ algorithms Q is a powerful tool that allows us to apply results 
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from learning theory. In particular, since parity is not learnable with a small number of SQ queries l39l 
but is PAC learnable privately we get that local algorithms require exponentially more data for some 
learning tasks than do general private algorithms. Our results also imply that local algorithms are strictly 
less powerful than (non-private) algorithms for learning with classification noise because subexponential 
(non-private) algorithms can learn parity with noise lfl3l . 

Adaptivity in SQ algorithms is important. Just as local algorithms can be interactive, SQ algorithms 
can be adaptive, that is, the averaging queries they make may depend on answers to previous queries. The 
equivalence of SQ and local algorithms Q preserves interaction/adaptivity: a concept class is nonadaptively 
SQ learnable if and only if it is noninteractively locally learnable. The masked parity class Q shows that 
interaction (resp., adaptivity) adds considerable power to local (resp., SQ) algorithms. 

Most of the reasons that local algorithms are so attractive in practice, and have received such attention, 
apply only to noninteractive algorithms (interaction can be costly, complicated, or even impossible — for 
instance, when statistical information is collected by an interviewer, or at a polling booth). 

This suggests that further investigating the power of nonadaptive SQ learners is an important problem. 
For example, the SQ algorithm for learning conjunctions ll42l is nonadaptive, but SQ formulations of the 
perceptron and fc-means algorithms [11] seem to rely heavily on adaptivity. 

Understanding the "price" of privacy for learning problems. The SQ result of Blum et al. ITTTI and our 
learner for parity §2ty provide efficient (i.e., polynomial time) private learners for essentially all the concept 
classes known (by us) to have efficient non-private distribution-free learners. Finding a concept class that 
can be learned efficiently, but not privately and efficiently, remains an interesting and important question. 

Our results also lead to questions of optimal sample complexity for learning problems of practical im- 
portance. The private simulation of SQ algorithms due to Blum et al. [11] uses a factor of approximately 
y/t/e more data points than the naive non-private implementation, where t is the number of SQ queries and 
e is the parameter of differential privacy (typically a small constant). In contrast, the generic agnostic learner 
Q uses a factor of at most 1/e more samples than the corresponding non-private learner. For parity, our 
private learner uses a factor of roughly 1/e more samples than, and about the same computation time as, 
the non-private learner. What, then, is the additional cost of privacy when learning practical concept classes 
(half-planes, low-dimensional curves, etc)? Can the theoretical sample bounds of ([T]l be matched by (more) 
efficient learners? 

1.1.2 Techniques 

Our generic private learner ^ adapts the exponential sampling technique of McSherry and Talwar ll44l . 
developed in the context of auction design. Our use of the exponential mechanism inspired an elegant 
subsequent result of Blum, Liggett, and Roth lfl4ll (BLR) on simultaneously approximating many different 
functions. 

The efficient private learner for parity Q uses a very different technique, based on sampling, running a 
non-private learner, and occasionally refusing to answer based on delicately calibrated probabilities. Run- 
ning a non-private learner on a random subset of examples is a very intuitive approach to building private 
algorithms, but it is not private in general. The private learner for parity illustrates both why this technique 
can leak private information and how it can sometimes be repaired based on special (in this case, algebraic) 
structure. 
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The interesting direction of the equivalence between SQ and local learners Q is proved via a simulation 
of any local algorithm by a corresponding SQ algorithm. We found this simulation surprising since local 
protocols can, in general, have very complex structure (see, e.g., [29]). The SQ algorithm proceeds by a 
direct simulation of the output of the randomization operators. For a given input distribution V and any 
operator R, one can sample from the corresponding output distribution R(T>) via rejection sampling. We 
show that if R is differentially private, the rejection probabilities can be approximated via low-accuracy SQ 
queries to V. 

Finally, the separation between adaptive and nonadaptive SQ Q uses a Fourier analytic argument in- 
spired by Kearns' SQ lower bound for parity [39]. 

1.1.3 Classes of Private Learning Algorithms 



'ARITY 

MASKED-PARITY 



Figure 2: Relationships among learning classes taking into account sample complexity, but not computational efficiency. 




We can summarize our results via a complexity-theoretic picture of learnable and privately learnable 
concept classes (more precisely, the members of the classes are pairs of concept classes and example dis- 
tributions). In order to make asymptotic statements, we measure complexity in terms of the length d of the 
binary description of examples. 

We first consider learners that use a polynomial (in d) number of samples and output a hypothesis that is 
described using a polynomial number of bits, but have unlimited computation time. Let PAC* denote the set 
of concept classes that are learnable by such algorithms ignoring privacy, and let PPAC* denote the subset 
of PAC* learnable by differentially privatd^ algorithms. 

Since we restrict the learner's output to a polynomial number of bits, the hypothesis classes of the 
algorithms are de facto limited to have size at most ex.p(poly(d)). Thus, the generic private learner (point 
Q in the introduction) will use a polynomial number of samples, and PAC* = PPAC*. 

We can similarly interpret the other results above. Within PAC* , we can consider subsets of concepts 
learnable by SQ algorithms (SQ*), nonadaptive SQ algorithms (NASQ*), local interactive algorithms (LI*) 
and local noninteractive algorithms (LNI*). We obtain the following picture (see page[6]): 

LNI* = NASQ* c LI*=SQ* c PPAC* = PAC*. 

The equality of LI* and SQ*, and of LNI* and NASQ*, follow from the SQ simulation of local algorithms 



(Theorem 5.14 1. The parity and masked-parity concept classes separate PPAC* from SQ* and SQ* from 



NASQ*, respectively (Corollaries 5.15 and 5.17). (Note: The separation of PPAC* from SQ* holds even 



for distribution-free learning; in contrast, the separation of SQ* from NASQ* holds for learnability under a 



differential privacy is quantified by a real parameter e > 0. To make qualitative statements, we look at algorithms where 
e — > as d — > oo. Taking e = l/d c for any constant c > would yield the same class. 
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specific distribution on examples, since the adaptive SQ learner for MASKED-PARITY requires a uniform 
distribution on examples.) 

When we take computational efficiency into account, the picture changes. The relation between local 
and SQ classes remain the same modulo a technical restriction on the randomization operators (Defini- 
tion 5.13 1. SQ remains distinct from PPAC since parity is efficiently learnable privately. However, it is 
an open question whether concept classes which can be efficiently learned can also be efficiently learned 
privately. 

1.2 Related Work 

Prior to this work, the literature on differential privacy studied function approximation tasks (e.g. 
l26l [Til |24l l47l 13), with the exception of the work of McSherry and Talwar on mechanism design ll44l . 
Nevertheless, several of these prior results have direct implications to machine learning-related problems. 
Blum et al. ifTTI considered a particular class of learning algorithms (SQ), and showed that algorithms in the 
class could be simulated using noisy function evaluations. In an independent, unpublished work, Chaudhuri, 
Dwork, and Talwar considered a version of private learning in which privacy is afforded only to input 
labels, but not to examples. Other works considered specific machine learning problems such as mining 
frequent itemsets [29], /c-means clustering IfTTI 1471 . learning decision trees [11], and learning mixtures of 
Gaussians ll47l . 

As mentioned above, a subsequent result of Blum, Ligett and Roth lfl4l on approximating classes of 
low-VC-dimension functions was inspired by our generic agnostic learner. We discuss their result further 



in Section 3.1 Since the original version of our work, there have also been several results connecting 
differential privacy to more "statistical" notions of utility, such as consistency of point estimation and density 
estimation ||50l l23l l54l 1561 . 

Our separation of interactive and noninteractive protocols in the local model Q also has a precedent: 
Dwork et al. E4ll separated interactive and noninteractive private protocols in the centralized model, where 
the user accesses the data via a server that runs differentially private algorithms on the database and sends 
back the answers. That separation has a very different flavor from the one in this work: any example of a 
computation that cannot be performed noninteractively in the centralized model must rely on the fact that the 
computational task is not defined until after the first answer from the server is received. (Otherwise, the user 
can send an algorithm for that task to the server holding the data, thus obviating the need for interaction.) In 
contrast, we present a computational task that is hard for noninteractive local algorithms - learning masked 
parity - yet is defined in advance. 

In the machine learning literature, several notions similar to differential privacy have been explored 
under the rubric of "algorithmic stability" |[T9ll40l[T6ll43ll28l l9l. The most closely related notion is change- 
one error stability, which measures how much the generalization error changes when an input is changed 
(see the survey ||43l ). In contrast, differential privacy measures how the distribution over the entire output 
changes — a more complex measure of stability (in particular, differential privacy implies change-one error 
stability). A different notion, stability under resampling of the data from a given distribution Ifl0l l9ll, is con- 
nected to the sample-and-aggregate method of ll47l but is not directly relevant to the techniques considered 
here. Finally, in a different vein, Freund, Mansour and Schapire OTTl used a weighted averaging technique 
with the same weights as the sampler in our generic learner to reduce generalization error (see Section [3~T| ). 
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2 Preliminaries 



We use [n] to denote the set {1, 2, . . . , n}. Logarithms base 2 and base e are denoted by log and In, respec- 
tively. Pr[-] and E[-] denote probability and expectation, respectively. A(x) is the probability distribution 
over outputs of a randomized algorithm A on input x. The statistical difference between distributions P and 
Q on a discrete space D is defined as m&xscD I P (S) — Q(S)\. 

2.1 Differential Privacy 

A statistical database is a vector z = (z\, . . . , z n ) over a domain D, where each entry zi G D represents 
information contributed by one individual. Databases z and z' are neighbors if z% ^ z[ for exactly one 
i G [n] (i.e., the Hamming distance between z and z' is 1). All our algorithms are symmetric, that is, they do 
not depend on the order of entries in the database z. Thus, we could define a database as a multi-set in D, 
and use symmetric difference instead of the Hamming metric to measure distance. We adhere to the vector 
formulation for consistency with the previous works. 

A (randomized) algorithm (in our context, this will usually be a learning algorithm) is private if neigh- 
boring databases induce nearby distributions on its outcomes: 

Definition 2.1 (e-differential privacy l24l0 . A randomized algorithm A is (.-differentially private if for all 
neighboring databases z, z', and for all sets S of outputs, 

Pr[^(z) G S\ < exp(e) • PrL4(z') G S]. 

The probability is taken over the random coins of A. 

In 11241 . the notion above was called "indistinguishability". The name "differential privacy" was sug- 
gested by Mike Schroeder, and first appeared in Dwork ETTl . 

Differential privacy composes well (see, e.g., Il22ll47ll44l l381): 

Claim 2.2 (Composition and Post-processing). If a randomized algorithm A runs k algorithms Ai, Ah, 
where each Ai is ei-differentially private, and outputs a function of the results (that is, »4(z) = g(A\(z), 
A2(z), Ak(z))for some probabilistic algorithm g), then A is (Yli=i e «) -differentially private. 

One method for obtaining efficient differentially private algorithms for approximating real- valued func- 
tions is based on adding Laplacian noise to the true answer. Let Lap(A) denote the Laplace probability 
distribution with mean 0, standard deviation \/2A, and p.d.f. f(x) = e l^'l / A . 

Theorem 2.3 (Dwork et al. O). For a function f : D n — > M, define its global sensitivity GSf = 
max zz ' \ f(z) — /(z')| where the maximum is over all neighboring databases z, z'. Then, an algorithm 
that on input z returns /(z) + r\ where r\ ~ Lap(G5//e) is e-differentially private. 

2.2 Preliminaries from Learning Theory 

A concept is a function that labels examples taken from the domain X by the elements of the range Y. 
A concept class C is a set of concepts. It comes implicitly with a way to represent concepts; size(c) is 
the size of the (smallest) representation of c under the given representation scheme. The domain and the 
range of the concepts in C are understood to be ensembles X = {Xd}deN and Y = {YdjdeN, where the 
representation of elements in Xd, Yd is of size at most d. We focus on binary classification problems, in 
which the label space Yd is {0, 1} or {+1, —1}; the parameter d thus measures the size of the examples 
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in Xd. (We use the parameter d to formulate asymptotic complexity notions.) The concept classes are 
ensembles C = {Cd}deN where Q is the class of concepts from Xd to Yd. When the size parameter is clear 
from the context or not important, we omit the subscript in Xd, Yd, Cd- 

Let D be a distribution over labeled examples in Xd x Yd. A learning algorithm is given access to V (the 
method for accessing V depends on the type of learning algorithm). It outputs a hypothesis h : Xd — > Yd 
from a hypothesis class % = {%d}deN- The goal is to minimize the misclassification error of h on T>, 
defined as 

err(h) = Pr \h(x) ^ y] . 

The success of a learning algorithm is quantified by parameters a and [3, where a is the desired error 
and p bounds the probability of failure to output a hypothesis with this error. Error measures other than 
misclassification are considered in supervised learning (e.g., iJf). We study only misclassification error 
here, since for binary labels it is equivalent to the other common error measures. 

A learning algorithm is usually given access to an oracle that produces i.i.d. samples from V. Equiv- 
alently, one can view the learning algorithm's input as a list of n labeled examples, i.e., z G D n where 



D = Xd x Yd. PAC learning and agnostic learning are described in Definitions 2.4 and 2.5 Another 
common method of access to V is via "statistical queries", which return the approximate average of a func- 
tion over the distribution. Algorithms that work in this model can be simulated given i.i.d. examples. See 
Section [5] 

PAC learning algorithms are frequently designed assuming a promise that the examples are labeled 
consistently with some target concept c from a class C: namely, c £ Q and y = c(x) for all (x, y) in the 
support of V. In that case, we can think of V as a distribution only over examples Xd- To avoid ambiguity, 
we use X to denote a distribution over Xd- In the PAC setting, err(h) = Pi xr ^x[h(x) / c(x)]. 

Definition 2.4 (PAC Learning). A concept class C over X is PAC learnable using hypothesis class T-L if 
there exist an algorithm A and a polynomial poly(-, -, •) such that for all d 6 N, all concepts c G Cd, 
all distributions X on Xd, and all a, /3 G (0, 1/2), given inputs a, /3 and z = (z\, ■ ■ ■ , z n ), where n = 
poly(d, 1 /a, log(l//3)), Z\ = (xi, c(xi)) and X{ are drawn i.i.d. from X for i G [n], algorithm A outputs a 
hypothesis h G H satisfying 

Pr[err(h) < a] > 1 - p. (1) 

The probability is taken over the random choice of the examples z and the coin tosses of A 

Class C is ( inefficiently ) PAC learnable if there exists some hypothesis class % and a PAC learner A such 
that A PAC learns C using %. Class C is efficiently PAC learnable if A runs it time polynomial in d, 1/a, 
and log(l//3). 

Remark: Our definition deviates slightly from the standard one (see, e.g., [42]) in that we do not take into 
consideration the size of the concept c. This choice allows us to treat PAC learners and agnostic learners 



identically. One can change Definition 2.4 so that the number of samples depends polynomially also on the 
size of c without affecting any of our results significantly. 

Agnostic learning ll32l |4D is an extension of PAC learning that removes assumptions about the target 
concept. Roughly speaking, the goal of an agnostic learner for a concept class C is to output a hypothesis 
h G % whose error with respect to the distribution is close to the optimal possible by a function from C. In 
the agnostic setting, err(h) = Pr( x ^^ v [h(x) ^ y}. 
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Definition 2.5 (Agnostic Learning). (Efficiently) agnostically learnable is defined identically to (efficiently) 
PAC learnable with two exceptions: (i) the data are drawn from an arbitrary distribution T> on x Y^; 
(ii) instead of Equation^ the output of A has to satisfy: 

Pr[err(h) < OPT + a] > 1 - 0, 

where OPT = minj g c d {err(f)} . As before, the probability is taken over the random choice ofz, and the 
coin tosses of A. 



Definitions 2.4 and 2.5 capture distribution-free learning, in that they do not assume a particular form 



for the distributions X or V. In Section 5.3 we also consider learning algorithms that assume a specific 
distribution V on examples (but make no assumption on which concept in C labels the examples). When we 
discuss such algorithms, we specify V explicitly; without qualification, "learning" refers to distribution-free 
learning. 



Efficiency Measures. The definitions above are sufficiently detailed to allow for exact complexity state- 
ments (e.g., "A learns C using n(a, (3) examples and time 0(i)"), and the upper and lower bounds in this 
paper are all stated in this language. However, we also focus on two broader measures to allow for qualitative 
statements: (a) polynomial sample complexity is the default notion in our definitions. With the novel restric- 
tion of privacy, it is not a priori clear which concept classes can be learned using few examples even if we 
ignore computation time, (b) We use the term efficient private learning to impose the additional restriction 
of polynomial computation time (which implies polynomial sample complexity). 



3 Private PAC and Agnostic Learning 

We define private PAC learners as algorithms that satisfy definitions of both differential privacy and PAC 
learning. We emphasize that these are qualitatively different requirements. Learning must succeed on 
average over a set of examples drawn i.i.d. from T> (often under the additional promise that V is consistent 
with a concept from a target class). Differential privacy, in contrast, must hold in the worst case, with no 
assumptions on consistency. 



Definition 3.1 (Private PAC Learning). Let d, a, (3 be as in Definition 2.4 and e > 0. Concept class C is 
(inefficiently) privately PAC learnable using hypothesis class % if there exists an algorithm A that takes 
inputs e, a, [3, z, where n, the number of labeled examples in z, is polynomial in 1/e, d, 1/a, log(l//3), and 
satisfies 



a. [Privacy] For all e > 0, algorithm A(e, •, •) is e-differentially private (Definition 2.1 ); 



b. [Utility] Algorithm A PAC learns C using H (Definition 2.4 1. 
C is efficiently privately PAC learnable if A runs in time polynomial in d, 1/e, 1/a, and log(l//3). 
Definition 3.2 (Private Agnostic Learning). (Efficient) private agnostic learning is defined analogously to 



(efficient) private PAC learning with Definition 2.5 replacing Definition 2.4 in the utility condition. 



Evaluating the quality of a particular hypothesis is easy: one can privately compute the fraction of the 
data it classifies correctly (enabling cross-validation) using the sum query framework of ifTTTl . The difficulty 
of constructing private learners lies in finding a good hypothesis in what is typically an exponentially large 
space. 
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3.1 A Generic Private Agnostic Learner 



In this section, we present a private analogue of a basic consistent learning result, often called the cardinality 
version of Occam's razoij^] This classical result shows that a PAC learner can weed out all bad hypotheses 
given a number of labeled examples that is logarithmic in the size of the hypothesis class (see ll42l p. 35]). 
Our generic private learner is based on the exponential mechanism of McSherry and Talwar l44ll . 

Let q : D n x % d — > E take a database z and a candidate hypothesis h, and assign it a score q(z,h) = 
— : Xi is misclassified by h, i.e., yi ^ h(xi)}\ . That is, the score is minus the number of points in z 
misclassified by h. The classic Occam's razor argument assumes a learner that selects a hypothesis with 
maximum score (that is, minimum empirical error). Instead, our private learner A € q is defined to sample a 
random hypothesis with probability dependent on its score: 

Ag(z) : Output hypothesis h € %d with probability proportional to exp 

Since the score ranges from — n to 0, hypotheses with low empirical error are exponentially more likely to 
be selected than ones with high error. 

Algorithm Ai fits the framework of McSherry and Talwar, and so is e-differentially private. This follows 
from the fact that changing one entry z% in the database z can change the score by at most 1. 

Lemma 3.3 (following [44]). The algorithm Ai is e-differentially private. 

A similar exponential weighting algorithm was considered by Freund, Mansour and Schapire lf3Tl for 
constructing binary classifiers with good generalization error bounds. We are not aware of any direct connec- 
tion between the two results. Also note that, except for the case where \Hd\ is polynomial, the exponential 
mechanism A q (z) does not necessarily yield a polynomial time algorithm. 

Theorem 3.4 (Generic Private Learner). For all d G N, any concept class C d whose cardinality is at most 
exp(poly(d)) is privately agnostically learnable using T-L d = C d - More precisely, the learner uses n = 
0((\n\7id\ + " mayi {^i ^2}) labeled examples from T>, where e,a, and (3 are parameters of the 
private learner. (The learner might not be efficient.) 



Proof. Let Ai be as defined above. The privacy condition in Definition 3.1 is satisfied by Lemma 3.3 

We now show that the utility condition is also satisfied. Consider the event E = {^(z) = h with err (h) > 
a + OPT}. We want to prove that Pv[E] < f3. Define the training error of h as 

err T (h) = \{i £ [n] \ h(xi) / yi}\/n = -q(z,h)/n. 



By Chernoff-Hoeffding bounds (see Theorem A.2| in Appendix [A]), 

Pr [\err(h) - err T (h)\ > p] < 2exp(-2n / o 2 ) 

for all hypotheses h € 7-Ld- Hence, 

Pr [\err(h) - err T (h)\ > p for some h e n d ] < 2\U d \ exp(-2np 2 ). 
3 We discuss the relationship to the "compression version" of Occam's razor at the end of this section. 
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We now analyze A e q (z) conditioned on the event that for all h G Hd, \err[h) — errT{h)\ < p. For every 
h G Ud, the probability that A q (z) = his 

exp(— | • n ■ errxih)) ^ exp (— | • n ■ errT(h)) 



Yth'eHd ex P( _ i - n - err T (h')) max h > eHd exp(-§ • n ■ err T (h')) 

= exp ( — • n ■ (errx(h) — min errxih')) ) 

< exp ■ n- (err T (h) - (OPT + p))^j . 

Hence, the probability that A e q (z) outputs a hypothesis h G Hd such that errx(h) > OPT + 2p is at most 

exp(-enp/2). 

Now set p = a/3. If err(h) > OPT + a then |err(/i) - err T [h)\ > a/3 or <?rr T (/i) > OPT + 2a/3. 
Thus Pt[E] < |7^d|(2exp(— 2na 2 /9) + exp(— ena/6)) < /3 where the last inequality holds for n > 
6((ln|%|+lnl)-max{^,^}). □ 

Remark: In the non-private agnostic case, the standard Occam's razor bound guarantees that 0((log |Q| + 
log(l/ /3))/a 2 ) labeled examples suffice to agnostically learn a concept class Q. The bound of Theorem 3.4 



differs by a factor of 0(f) if a > e, and does not differ at all otherwise. For (non-agnostic) PAC learning, 
the dependence on a in the sample size for both the private and non-private versions improves to I /a. In 
that case the upper bounds for private and non-private learners differ by a factor of 0(1/ e). Finally, the 
theorem can be extended to settings where ^ Cd, but in this case using the same sample complexity the 
learner outputs a hypothesis whose error is close to the best error attainable by a function in %d- 



Implications of the Private Agnostic Learner The private agnostic learner has the following important 
consequence: If some concept class Cd is learnable by any algorithm A, not necessarily a private one, 
and „4's output length in bits is polynomially bounded, then there is a (possibly exponential time) private 
algorithm that learns Q using a polynomial number of samples. Since A's output is polynomially long, „4's 
hypothesis class Hd must have size at most 2 poly ( d ). Since A learns Cd using Tid, class T~Ld must contain a 
good hypothesis. Thus, our private learner will learn Cd using Tid with sample complexity linear in log \ Hd\- 



The "compression version" of Occam's razor It is most natural to state our result as an analogue of 
the cardinality version of Occam's razor, which bounds generalization error in terms of the size of the 
hypothesis class. However, our result can be extended to the compression version, which captures the 
general relationship between compression and learning (we borrow the "cardinality version" terminology 
from EH). This latter version states that any algorithm which "compresses" the data set, in the sense that it 
finds a consistent hypothesis which has a short description relative to the number of samples seen so far, is 
a good learner (see lfl5l and J42j p. 34]). 

Compression by itself does not imply privacy, because the compression algorithm's output might encode 
a few examples in the clear (for example, the hyperplane output by a support vector machine is defined 
via a small number of actual data points). However, Theorem 3.4 can be extended to provide a private 
analogue of the compression version of Occam's razor. If there exists an algorithm that compresses, in 
the sense above, then there also exists a private PAC learner which does not have fixed sample complexity, 
but uses an expected number of samples similar to that of the compression algorithm. The private learner 
proceeds in rounds: at each round it requests twice as many examples as in the previous round, and uses a 
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restricted hypothesis class consisting of sufficiently concise hypotheses from the original class %. We omit 
the straightforward details. 



3.2 Private Learning with VC dimension Sample Bounds 

In the non-private case one can also bound the sample size of a PAC learner in terms of the Vapnik- 
Chervonenkis (VC) dimension of the concept class. 

Definition 3.5 (VC dimension). A set S C Xd is shattered by a concept class Cd if Cd restricted to S 
contains all 2^ s \ possible functions from S to {0, 1}. The VC dimension of Cd, denoted VCDIMiCd), is 
the cardinality of a largest set S shattered by Cd- 



We can extend Theorem 3.4 to classes with finite VC dimension, but the resulting sample complexity 
also depends logarithmically on the size of the domain from which examples are drawn. Recent results 
of Beimel et al. [8j show that for "proper" learning, the dependency is in fact necessary; that is, the VC 
dimension alone is not sufficient to bound the sample complexity of proper private learning. It is unclear if 
the dependency is necessary in general. 

Corollary 3.6. Every concept class Cd is privately agnostically learnable using hypothesis class %d = Cd 
with n = 0{(VCDIM(Cd) • In \Xd\ + In i) ■ maxj^, ^2 }) labeled examples from T>. Here, e, a, and /3 
are parameters of the private agnostic learner, and VCDIMiCd) is the VC dimension of Cd- (The learner 
is not necessarily efficient.) 

Proof. Sauer's lemma (see, e.g., EH) implies that there are 0(\X d \ VCDIM ( c d)} different labelings of X d 
by functions in Q. We can thus run the generic learner of the previous section with a hypothesis class of 
size \Ud\ = 0{\X d \ VCDIM{ - Cd) ). The statement follows directly. □ 

Our original proof of the corollary used a result of Blum, Ligget and Roth [ 14 ] (which was inspired, in 
turn, by our generic learning algorithm) on generating synthetic data. The simpler proof above was pointed 
out to us by an anonymous reviewer. 

Remark: Computability Issues with Generic Learners In their full generality, the generic learning 



results of the previous sections (Theorems 3.4 and 3.6 ) produce well-defined randomized maps, but not nec- 



essarily "algorithms" in the sense of "functions uniformly computable by Turing machines". This is because 
the concept class and example domain may themselves not be computable (nor even recognizable) uniformly 
(imagine, for example, a concept class indexed by elements of the halting problem). It is commonly assumed 
in the learning literature that elements of the concept class and domain can be computed/recognized by a 
Turing machine and some bound on the length of their binary representations is known. In this case, the 
generic learners can be implemented by randomized Turing machines with finite expected running time. 



4 An Efficient Private Learner for PARITY 

Let PARITY be the class of parity functions c r : {0, l} d — > {0, 1} indexed by r e {0, l} d , where c r (x) = 
r x denotes the inner product modulo 2. In this section, we present an efficient private PAC learning 



algorithm for PARITY. The main result is stated in Theorem 4.4 



The standard (non-private) PAC learner for PARITY |[33ll30ll looks for the hidden vector r by solving a 
system of linear equations imposed by examples (xi, c r (xi)) that the algorithm sees. It outputs an arbitrary 
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vector consistent with the examples, i.e., in the solution space of the system of linear equations. We want 
to design a private algorithm that emulates this behavior. A major difficulty is that the private learner's 
behavior must be specified on all databases z, even those which are not consistent with any single parity 
function. The standard PAC learner would simply fail in such a situation (we denote failure by the output 
_L). In contrast, the probability that a private algorithm fails must be similar for all neighbors z and z'. 

We first present a private algorithm A for learning PARITY that succeeds only with constant probability. 
Later we amplify its success probability and get a private PAC learner A* for PARITY. Intuitively, the reason 
PARITY can be learned privately is that when a new example (corresponding to a new linear constraint) is 
added, the space of consistent hypotheses shrinks by at most a factor of 2. This holds unless the new 
constraint is inconsistent with previous constraints. In the latter case, the size of the space of consistent 
hypotheses goes to 0. Thus, the solution space changes drastically on neighboring inputs only when the 
algorithm fails (outputs _L). The fact that algorithm outputs _L on a database z and a valid (non _L) hypothesis 
on a neighboring database z' might lead to privacy violations. To avoid this, our algorithm always outputs 
_L with probability at least 1/2 on any input (Step 1). 

A PRIVATE LEARNER FOR PARITY, A(z, e) 



1. With probability 1/2, output _L and terminate. 

2. Construct a set S by picking each element of [n] independently with probability p = e/4. 

3. Use Gaussian elimination to solve the system of equations imposed by examples, indexed by S: 
namely, {x^ r = c r (xi) : i G S}. Let Vs denote the resulting affine subspace. 

4. Pick r* G Vs uniformly at random and output cy* ; if Vs = 0, output _L. 



The proof of A's utility follows by considering all the possible situations in which the algorithm fails to 
satisfy the error bound, and by bounding the probabilities with which these situations occur. 

Lemma 4.1 (Utility of A). Let X be a distribution over X = {0, l} d . Let z = (z±, . . . , z n ), where for all 
i G [n], the entry Z{ = (xj, c(xj)) with Xi drawn i.i.d. from X and c G PARITY. If n > — (din 2 + In 4) 
then 

Pr[A(z, e) = h with error(h) <«]>-. 
Proof. By standard arguments in learning theory 11421 . \S\ > — (din 2 + In — J labeled examples are 

Q V P ) 

sufficient for learning PARITY with error a and failure probability /3. Since A adds each element of [n] to 
S indepen dentl y with probability p = e/4, the expected size of S is pn = en/4. By the Chernoff bound 



(Theorem A.l), \S\ > en/8 with probability at least 1 — e m / 16 . We set = \ and pick n such that 
en/8 > i(dln2 + ln4). 

We now bound the overall success probability. A(z, e) = h with err(h) < a unless one of the following 
bad events happens: (i) A terminates in Step 1, (ii) A proceeds to Step 2, but does not get enough examples: 
|5| < - (din 2 + In 4)), (iii) A gets enough examples, but outputs a hypothesis with error greater than a. 
The first bad event occurs with probability 1/2. If the lower bound on the database size n is satisfied then 
the second bad event occurs with probability at most e -en / 16 /2 < 1/8. The last inequality follows from the 
bound on n and the fact that a < 1/2. Finally, by our choice of parameters, the last bad event occurs with 
probability at most (3/2 = 1/8. The claimed bound on the success probability follows. □ 
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Lemma 4.2 (Privacy of A). Algorithm A is e-differentially private. 

As mentioned above, the key observation in the following proof is that including of any single point in 
the sample set S increases the probability of a hypothesis being output by at most 2. 

Proof. To show that A is e-differentially private, it suffices to prove that any output of A, either a valid 
hypothesis or _L, appears with roughly the same probability on neighboring databases z and z'. In the 
remainder of the proof we fix e, and write A{z) as shorthand for A(z, e). We have to show that 

Pr[„4(z) = h] < e e • Pr[«4(z') = h] for all neighbors z, z' G D n and all hypotheses h G PARITY; (2) 
Pr[A(z) = ]_] < e e • Pr[^(z') = ]_] for all neighbors z, z' G L> n . (3) 

We prove the correctness of Eqn. ([2]) first. Let z and z' be neighboring databases, and let i denote the entry 
on which they differ. Recall that A adds % to S with probability p. Since z and z' differ only in the i th entry, 
Pr[„4(z) = h\iiS} = Pi[A{z') = /i | i £ S}. 

Note that if Pr[^(z') = h \ i <£ S] = 0, then also Pr[.A(z) = h \ i <£ S] = 0, and hence Pr[„4(z) = h] = 
because adding a constraint does not add new vectors to the space of solutions. Otherwise, Pr[^4(z') = 
h | i ^ 5] > 0. In this case, we rewrite the probability on z as follows: 

Pr[^(z) =h]=p- Pr[A(z) = h | i G S] + (1 - p) ■ Yv[A{z) =h\i£S], 

and apply the same transformation to the probability on z'. Then 

Pr[^4(z) = h] p- Pr[^(z) = h \ i G S) + (1 - p) ■ Pr[X(z) =h\i£S\ 



Pt[A(z') = h] p- Pr[A{z') = h | i G S) + (1 - p) ■ Pr[A(z>) = h\i£S\ 
p • Pt[A{z) =h\ieS] + (l-p)- Pr[A(z) = h\iiS] 



< 



p ■ + (1 - p) ■ Pv[A{z>) =h\iiS\ 
p Pv[A{z) = h I i G S] 



1-p Pr[A{z) = h\iiS] 



+ 1 (4) 



We need the following claim: 



Pr\A(z) = h \ i G S\ 

Claim 4.3. — A — < % for all z G D n and all hypotheses h G PARITY. 

Pr[,4(z) = h | i <£ S] ~ 

This claim is proved below. For now, we can plug it into Eqn. Q to get 

PrU(z) = h] 2p 

— - — — < — - — hl<e + l<e 

Pv[A(z') = h] ~ l-p + " + " ' 

The first inequality holds since p = e/4 and e < 1/2. This establishes Eqn. Q. The proof of Eqn. Q is 
similar: 

Pr[^4(z) =±] p • Pt[A{z) =± | * G 5] + (1 - p) ■ Pr[A{z) =± \ i£S\ 



Pr[A(z') =±] p ■ Pv[A(z') =± | i G S] + (1 - p) ■ Pv[A(z') =± | i ^ 5] 
p • 1 + (1 - p) • Pr[^4(z) =± | i ^ 5] 



< 



p • + (1 - p) • Pr[^(z') =_L | i ^ 5] 



+ 1 < — ^- + 1 < e + 1 < e e . 



(1 - p) • Pr^(z') =± \i<£S] -1-p 

In the last line, the first inequality follows from the fact that on any input, A outputs _L with probability at 
least 1/2. This completes the proof of the lemma. □ 
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We now prove Claim 4.3 



Proof of Claim ^4~3\ The left hand side 

PrL4(z) = h\ieS] £rc[n]\{i} Pr[A(z) = h\S = TU {i}} ■ Pi\A selects T from [n] \ {i}] 

Pr[A(z) = h\i£S\ ~ ^Tc[ n ]\{i} Pr[^4(z) = h \ S = T] ■ Pi[A selects T from [n] \ {i}} ' 

PrL4(z) = h I 5 = T U {ill 
To prove the claim, it is enough to show that — — . — — ■ — < 2 for each T C [n] \ {i}. 

Recall that Vs is the space of solutions to the system of linear equations {(xi, r) = c r (xi) : i G S}. Recall 
also that A picks r* € uniformly at random and outputs h = c r *. Therefore, 



Pr[^(z) = c r * I S] 



l/\V s \ if r*eV s , 
otherwise. 



If Pr[A(z) = h I S = T] = then Pr[^4(z) = h \ S = T U {i}] = because a new constraint does not add 
new vectors to the space of solutions. If Pr[^4(z) = h | S = T U {i}} = 0, the required inequality holds. If 
neither of the two probabilities is 0, 

Pr[A(z) = h\ S = TU{i}} = l/\V T u{i}\ __ \Vt\ 

Pt[A(z) = h I S = T] 1/\V T \ \V Tm \ ~ ■ 

The last inequality holds because in Z2 (the finite field with 2 elements where arithmetic is performed 
modulo 2), adding a consistent linear constraint either reduces the space of solutions by a factor of 2 (if the 
constraint is linearly independent from Vt) or does not change the solutions space (if it is linearly dependent 
on the previous constraints). The constraint indexed by % has to be consistent with constraints indexed by T, 
since both probabilities are not 0. □ 

It remains to amplify the success probability of A. To do so, we construct a private version of the 
standard (non-private) algorithm for amplifying a learner's success probability. The standard amplification 
algorithm generates a set of hypotheses by invoking A multiple times on independent examples, and then 
outputs a hypothesis from the set with the least training error as evaluated on a fresh test set (see 11421 for 
details). Our private amplification algorithm differs from the standard algorithm only in the last step: it 
adds Laplacian noise to the training error to obtain a private version of the error, and then uses the perturbed 
training error instead of the true training error to select the best hypothesis from the set. Recall that 
Lap(A) denotes the Laplace probability distribution with mean 0, standard deviation and p.d.f. f(x) = 

2A e 



4 Alternatively, we could use the generic learner from Theorem 3.4 to select among the candidate hypotheses; the resulting 



algorithm has the same asymptotic behavior as the algorithm we discuss here. We chose the algorithm that we felt was simplest. 
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Amplified private PAC learner for PARITY, A*(z, e, a, {3) 

1. ft 4- f ; a' 4- f ; k 4- log3 \ 4r) ; n' 4- s 4- ^ log f Ju (where c, c' are constants). 

2. If n < fcn' + s, stop and return "insufficient samples". 

3. Divide z = (zi,...,z n ) into two parts, training set z = (zi, . . . , Z}. n >) and test set z 

(^fcn'+l) • • • , z kn'+s)- 

4. Divide z into equal parts each of size n', let Zj = (zrj_i} n r +1 , . . . , Zj n >) for j £ [/;]. 

5. For j 4— 1 to k 

hj 4- A(z,j, e); 

f f u j f • ■ , - /n |{^ G z : hjfa) / c(a^)}| /" 
set perturbed training error ol hj to errT(hj) = J + Lap 



s \se 

6. Output /i* = hj* where j* = argmin Jg m{efrr(/i : ,)}. 



Theorem 4.4. Algorithm A* efficiently and privately PAC learns PARITY (according to Definition 3.1 ) with 
O ( dlog(1//3) N ) rampto. 



The theorem follows from Lemmas 4.5 and 4.6 that, respectively, prove privacy and utility of A*. 



Lemma 4.5 (Privacy of A*). Algorithm A* is e-differentially private. 

Proof. We prove that even if A* released all hypotheses hj, computed in Step[5j together with the corre- 
sponding perturbed error estimates errT(hj), it would still be e-differentially private. Since the output of A* 
can be computed solely from this information, Claim [272] implies that A* is e-differentially private. 

By Lemma |4.2| algorithm A is e-differentially private. Since A is invoked on disjoint parts of z to 
compute hypotheses hj, releasing all these hypotheses would also be e-differentially private. 

Define the training error of hypothesis hj on z as errxihj) = \{zi G z : hj(x{) / c(xi)}\/s. The 
global sensitivity of the err^ function is 1/s because | errx (z) —errx {z')\ < 1/s for every pair of neighboring 



databases z, z'. Therefore, by Theorem 2.3 releasing errT^hj) for one j, would be e/ /^-differentially private, 
and by Claim 2.2 releasing all k of them would be e-differentially private. Since hypotheses hj and their 
perturbed errors errj>(hj) are computed on disjoint parts of the database z, releasing all that information 
would still be e-differentially private. □ 

Lemma 4.6 (Utility of A*). A*{-, e, •, •) PAC learns PARITY with sample complexity n = 0{ dlog ^ ,f}) ). 

Proof. Let X be a distribution over X = {0, l} d . Recall that z = (z%, . . . , z n ), where for all i £ [n], 
the entry z% = (xi,c(xi)) with X{ drawn i.i.d. from X and c G PARITY. Assume that /3 < 1/4, and 
n > c dXog ^^ for a constant C to be determined. We wish to prove that Pr[err(h*) < a] > 1 — j3, where 
h* is the hypothesis output by A*. 

Consider the set of candidate hypotheses {hi, hu] output by the invocations of A inside of A*. We 
call a hypothesis h good if err(h) < | = a'. We call a hypothesis h bad if err(h) > a = 5a'. Note that 
good and bad refer to a hypothesis' true error rate on the underlying distribution. 

We will show: 

1. With probability at least 1 — j3', one of the invocations of A outputs a good hypothesis. 
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2. Conditioned on any particular outcome {hi, h^} of the invocations of A, with probability at least 
1 - ft, both: 

(a) Every good hypothesis hj in {hi, hk} has training error errxihj) < 2a'. 

(b) Every bad hypothesis hj in {hi, hk} has training error errxihj) > 4a'. 

3. Conditioned on any particular hypotheses {hi, %} and training errors errx{hi), errT(hk), with 
probability at least 1 — /?', for all j simultaneously, \erhr(hj) — errT(hj)\ < a'. 

Suppose the events described in the three claims above all occur. Then some good hypothesis has 
perturbed training error less than 3a', yet all bad hypotheses have perturbed training error greater than 3a'. 
Thus, the hypothesis hj* with minimal perturbed error errx{hj*) is not bad, that is, has true error at most 
a. By the claims above, the probability that all three events occur is at least 1 — 3/3' = 1-/3, and so the 
lemma holds. We now prove the claims. 

First, by the utility guarantee of A, each invocation of A inside A* outputs a good hypothesis with 
probability at least \ as long as the constant c > 8(ln2 + In 4) (since in that case n', the size of each zj, 



is large enough to apply Lemma 4.1 1. The fc invocations of the algorithm A are on independent samples, 



so the probability that none of hi, . . . , % is good is at most (| ) fc . Setting fc > log3 ^7 ensures that with 
probability at least 1 — /?', at least one of hi, . . . , hk has error at most a'. 

Second, fix a particular sequence of candidate hypotheses hi, hf,. For each j, the training error 
errT(hj) is the average of s Bernouilli trials, each with success probability err(hj). (Crucially, the training 
set z is independent of the data z used to find the candidate hypotheses). To bound the training error, we 



apply the multiplicative Chernoff bound (Theorem A.l ) with n = s and p = err(hj). Here, p < a' if hj is 
good, and p > 5 a' if hj is bad. 

^ff k™,n^ fTUam-am A 1 \ if c ^ C l In 



By the multiplicative Chernoff bound (Theorem A. lb if s > % In M, (for appropriate constant ci), then 



ff 

Pr [errT(hj) > 2a' | hj is good] < Pr[Binomial(s, a') > 2a' s] < — , and 

K 

& 

Pr [err T {hj) < 4a \ hj is bad] < Pr[Binomial(s, 5a') < 4a s] < 



By a union bound, all the training errors are (simultaneously) approximately correct, with probability at 

fc 



least 1 - k ■ t = 1 - /3'. 



Finally, we prove the third claim. Consider a particular candidate hypothesis hj. If s > ^| In A (for 



appropriate constant C2), then (by using the c.d.fj^jof the Laplacian distribution) 



Pr [|errr(/tj) — erri'{hj)\ < a'] = Pr 



Lap ( A ) > r / 



< 
- fc 



By a union bound, all fc perturbed estimates are within a' of their correct value with probability at least 
1 — fc • 4- = 1 — jS'. This probability is taken over the choice of Laplacian noise, and so the bound holds 
independently of the particular hypotheses or their training error estimates. □ 



Remark: In the non-private case 0((d+ln(l/ /3)) / a) labels are sufficient for learning PARITY. Theorem 4.4 
shows that the upper bounds on the sample size of private and non-private learners differ only by a factor of 

0(m(l//3)/e). 



5 The cumulative distribution function of the Laplacian distribution Lap(A) is F(x) = | exp (f) if x < and 1 — | exp ( 
if x > 0. 
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5 Local Protocols and SQ learning 



In this section, we relate private learning in the local model to the SQ model of Kearns II391 . We first define 
the two models precisely. We then prove their equivalence (Section 5.1 ), and discuss the implications for 
learning (Section 5.2 1. Finally, we define the concept class MASKED-PARITY and prove that it separates 
interactive from noninteractive local learning (Section|5~3]). 



Local Model. We start by describing private computation in the local model. Informally, each individual 
holds her private information locally, and hands it to the learner after randomizing it. This is modeled by 
letting the local algorithm access each entry zi in the input database z = (z\, . . . , z n ) £ D n only via local 
randomizers. 

Definition 5.1 (Local Randomizer). An e-local randomizer R : D — ^ W is an e-differentially private 
algorithm that takes a database of size n = 1. That is, Vr[R(u) = w] < e e Pi[R(u') = w] for all u,u' £ D 
and all w £ W. The probability is taken over the coins of R (but not over the choice of the input). 

Note that since a local randomizer works on a data set of size 1, u and u' are neighbors for all u, u' £ D. 
Thus, this definition is consistent with our previous definition of differential privacy. 

Definition 5.2 (LR Oracle). Let z = {z\, . . . , z n ) £ D n be a database. An LR oracle LR Z (-, •) gets an 
index i £ [n] and an e-local randomizer R, and outputs a random value w £ W chosen according to the 
distribution R(zi). The distribution R(zi) depends only on the entry zi in z. 

Definition 5.3 (Local algorithm). An algorithm is e-local if it accesses the database z via the oracle LR Z 
with the following restriction: for all i £ [n], ifLR z (i, R\), . . . , LR z (i, R^) are the algorithm's invocations 
of LR Z on index i, where each Rj is an ej-local randomizer, then ei + • • • + < e. 

Local algorithms that prepare all their queries to LR Z before receiving any answers are called nonin- 
teractive; otherwise, they are interactive. 

By Claim [272| e-local algorithms are e-differentially private. 

SQ Model. In the statistical query (SQ) model, algorithms access statistical properties of a distribution 
rather than individual examples. 

Definition 5.4 (SQ Oracle). Let D be a distribution over a domain D. An SQ oracle SQx> takes as input a 
function g : D — >• {+1, — 1} and a tolerance parameter t £ (0,1); it outputs v such that: 

\v - E [g(u)]\ <t. 

The query function g does not have to be Boolean. Bshouty and Feldman iTTTl showed that given access 
to an SQ oracle which accepts only boolean query functions, one can simulate an oracle that accepts real- 
valued functions g : D — > [—6,6], and outputs Eu~v[g(u)] ± r using 0(log(6/r)) nonadaptive queries to 
the SQ oracle and similar processing time. 

Definition 5.5 (SQ algorithm). An SQ algorithm accesses the distribution V via the SQ oracle SQx>- SQ 
algorithms that prepare all their queries to SQx> before receiving any answers are called nonadaptive; 
otherwise, they are called adaptive. 

Note that we do not restrict g() to be efficiently computable. We will distinguish later those algorithms 
that only make queries to efficiently computable functions g(). 
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5.1 Equivalence of Local and SQ Models 

Both the SQ and local models restrict algorithms to access inputs in a particular manner. There is a signifi- 
cant difference though: an SQ oracle sees a distribution V, whereas a local algorithm takes as input a fixed 
(arbitrary) database z. Nevertheless, we show that if the entries of z are chosen i.i.d. according to T>, then 
the models are equivalent. Specifically, an algorithm in one model can simulate an algorithm in the other 
model. Moreover, the expected query complexity is preserved up to polynomial factors. We first present 



the simulation of SQ algorithms by local algorithms (Section 5.1.1 1. The simulation in the other direction is 



more delicate and is presented in Section 5.1.2 



5.1.1 Simulation of SQ Algorithms by Local Algorithms 

Blum et al. ifTTI used the fact that sum queries can be answered privately with little noise to show that any 
efficient SQ algorithm can be simulated privately and efficiently. We show that it can be simulated efficiently 
even by a local algorithm, albeit with slightly worse parameters. 



Let g : D — > [— b, b] be the SQ query we want to simulate. By Theorem 2.3 since the global sensitivity 
of g is 26, the algorithm R g (u) = g{u) + 77 where 77 ~ Lap(26/e) is an e-local randomizer. We construct 
a local algorithm A g that, given n and e, and access to a database z via oracle LR Z , invokes LR Z for every 
i G [n] with the randomizer R g and outputs the average of the responses: 



A LOCAL ALGORITHM A g (n, e, LR Z ) THAT SIMULATES AN SQ QUERY g : D — >• [-b, b] 
1. Output i Y!i=\ L Rz(i, Rg) where R g (u) = g(u) + 77 and r] ~ Lap (f). 



Note that A g outputs (A Y17=l d( z i)) + (n X^=i Vi)' where the 77, are i.i.d. from Lap (^f). This algo- 
rithm is e-local (since it applies a single e-local randomized to each entry of z), and therefore e-differentially 
private. The following lemma shows that when the input database z is large enough, A g simulates the desired 
SQ query g with small error probability. 

Lemma 5.6. If for sufficiently large constant c, database z has n > c ■ log (^) 6 entries sampled i.i.d. 
from a distribution V on D then algorithm A g approximates K u ~v[g{u)] within additive error ±r with 
probability at least 1 — j3. 

Proof. Let v = E u ~v[g(u)] denote the true mean. By the Chernoff-Hoeffding bound for real- valued vari- 
ables (Theorem A.2| ), 

Pr[\^ =1 g(u t )-v\>^<2e W (-^ 



Therefore, in the absence of additive Laplacian random noise, O ^ M 1 // 3 )?' j examples are enough to ap- 
proximate E u ~T>[g(u)] within additive error ±5 with probability at least 1 — f ■ (Note that the number of 
examples is smaller than the lower bound on n in the lemma by a factor of 0(e -2 )). 

The effect of the Laplace noise can also be bounded via a standard tail inequality: setting A = ^ in 

Lemma 



A.3 



we get that O { ^ n ^(^} b \ samples are sufficient to ensure that the average of r/^'s lies outside 

[— 5 , |] with probability at most f . It follows that A g estimates E u ~T>[g{u)] within additive error ±r with 
probability at least 1 — /3. □ 
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Simulation. Lemma 5.6 suggests a simple simulation of a nonadaptive (resp. adaptive) SQ algorithm by 
a noninteractive (resp. interactive) local algorithm as follows. Assume the SQ algorithm makes at most t 
queries to an SQ oracle SQx>- The local algorithm simulates each query (g, r) by running A g (n', e, LR Z ) 
with parameters j3' = j and n' = c ■ log ^/ T ^ ^ b on a previously unused portion of the database z containing 
n' entries. 

Theorem 5.7 (Local simulation of SQ). Let Asq be an SQ algorithm that makes at most t queries to an 
SQ oracle SQd, each with tolerance at least r. The simulation above is e- differentially private. If, for 
sufficiently large constant c, database z has n > c • tlos ^^ b entries sampled i.i.d.from the distribution T> 
then the simulation above gives the same output as Asq with probability at least 1 — /3. 

Furthermore, the simulation is noninteractive if the original SQ algorithm Asq is nonadaptive. The 
simulation is efficient if Asq is efficient. 

Proof. Each query is simulated with a fresh portion of z, and hence privacy is preserved as each entry is 
subjected to a single application of the e-local randomizer R. By the union bound, the probability of any 
of the queries not being approximated within additive error r is bounded by /3. If .Asq is nonadaptive, all 
queries to LR Z can be prepared in advance. □ 



5.1.2 Simulation of Local Algorithms by SQ Algorithms 

Let z be a database containing n entries drawn i.i.d. from V. Consider a local algorithm making t queries to 
LR Z . We show how to simulate any local randomizer invoked by this algorithm by using statistical queries 
to SQt>- Consider one such randomizer R : D — > W applied to database entry Z{. To simulate R we need to 
sampler £ W with probability p(w) = Pv Zi ^T>[R{zi) = w] taken over choice of Zi ~ V and random coins 
of R. (For interactive algorithms, it is more complicated, as the outputs of different randomizers applied to 
the same entry Zj have to be correlated.) 



A brief outline. The idea behind the simulation is to sample from a distribution p(-) that is within small 
statistical distance of p(-). We start by applying R to an arbitrary input (say, 0) in the domain D and obtaining 
a sample w ~ i?(0). Let q(w) = Pr [R(0) = w] (where the probability is taken only over randomness in R). 
Since R is e-differentially private, q(w) approximates p(w) within a multiplicative factor of e € . To sample 
w from p(-) we use the following rejection sampling algorithm: (i) sample w according to q(-); (ii) with 
probability W| e , output w; (iii) with the remaining probability, repeat from (i). 

To carry out this strategy, we must be able to estimate p(w), which depends on the (unknown) distri- 
bution V, using only SQ queries. The rough idea is to express p(w) as the expectation, taken over z ~ V, 
of the function h(z) = Pt[R(z) = w] (where the probability is taken only over the coins of R). We can 
use h as the basis of an SQ query. In fact, to get a sufficiently accurate approximation, we must rescale the 
function h somewhat, and keep careful track of the error introduced by the SQ oracle. We present the details 
in the proof of the following lemma: 

Lemma 5.8. Let z be a database with entries drawn i.i.d. from a distribution T>. For every noninter- 
active (resp. interactive) local algorithm A making t queries to LR Z , there exists a nonadaptive (resp. 
adaptive) statistical query algorithm B that in expectation makes 0(t ■ e e ) queries to SQt> with accuracy 
t = @((3 / (e 2e t)), such that the statistical difference between B's and A's output distributions is at most (3. 



Proof. We split the simulation over Claims 5.9 and 5.10 In the first claim we simulate noninteractive local 
algorithms using nonadaptive SQ algorithms. In the second claim we simulate interactive local algorithms 
using adaptive SQ algorithms. 
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Claim 5.9. For every noninter active local algorithm A making t nonadaptive queries to LR Z , there exists 
a nonadaptive statistical query algorithm B that in expectation makes t ■ e e queries to SQt> with accuracy 
t = @((3/(e 2e t)), such that the statistical difference between B's and A's output distributions is at most j3. 

Proof. We show how to simulate an e-local randomizer R using statistical queries to SQx>- Because the 
local algorithm is non-interactive, we can assume without loss of generality that it accesses each entry z% 
only once. (Otherwise, one can combine different operators, used to access Zi, by combining their answers 
into a vector). Given R : D — ^ W, we want to sample w G W with probability: 

p{w) = Pr [R(zi) = w]. 

Two notes regarding our notation: (i) As Zi is drawn i.i.d. from V we could omit the index i. We leave 
the index i in our notation to emphasize that we actually simulate the application of a local randomizer R to 
entry i. (ii) The semantics of Pr changes depending on whether it appears with the subscript Zi ~ V or not. 
Pr 2 -^x> denotes probability that is taken over the choice of z\ ~ V and the randomness in R, whereas when 
the subscript is dropped Zi is fixed and the probability is taken only over the randomness in R. Using this 
notation, ¥r Zi ^ v [R(zi) = w] = E Zi ~T> Pr[R(zi) = w]. 

We construct an algorithm Br^ that given t, f3, and access to the SQ oracle, outputs w £ W, such that 
the statistical difference between the output probability distributions of Br j6 and the simulated randomizer 
R is at most (3/t. Because the local algorithm makes t queries, the overall statistical distance between the 
output distribution of the local algorithm and the distribution resulting from the simulation is at most /3, as 
desired. 



AN SQ ALGORITHM Bn <t (t, (3, SQd) THAT SIMULATES AN e-LOCAL RANDOMIZER R : D ->• W. 



1. Sample w ~ R(0). Let q(w) = Pr[i?(0) = w]. 

2. Define 9 : D - [-1, 1] by 9 ( Zi ) = ^.^'-f , and let r = ^ 

3. Query the SQ oracle v = SQ-o{g, t), and let p(w) = vq(w)(e e — e~ e ) + q(w 



4. With probability , p } w \ - — , output w. 

With the remaining probability, repeat from Step 1 . 



We now show that the statistical distance between the output of Bp, :€ (t, (3, SQx>) and the distribution p(-) 
is at most /3/t. As mentioned above, our initial approximation p(-) of p(-) in Step 1 is obtained by applying 
R to some arbitrary input (namely, 0) in the domain D and sampling w ~ R(0). Since R is e-differentially 
private, q(w) = Pr[f2(0) = w] approximates p(w) within a multiplicative factor of e € . 

However, to carry out the rejection sampling strategy, we need to get a much better estimate of p(w). 
Steps 2 and 3 compute such an estimate, p(w), satisfying (with probability 1) 

p(w) G (1 ± 4>)p(w) where <ft = . (5) 

We establish the inclusion (|5]) below. For now, assume it holds on every iteration. Step 4 is a rejection 
sampling step which ensures that the output will follow a distribution close to p(-). Inclusion (J5J) guarantees 
that - j is at most 1, so the probability in Step 4 is well defined. The difficulty is that the quantity 
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p(w) is not a well-defined function of w: it depends on the SQ oracle and may vary, for the same w, from 
iteration to iteration. 

Nevertheless, p is fixed for any given iteration of the algorithm. In the given iteration, any particular 
element w gets output with probability q(w) x ^w^j^i = (i+^w ■ The probability that the given iteration 
terminates (i.e., outputs some w) is then p te rminate = Y, w • B y ©' tnis probability is in (1 ^ ££ . 
Thus, conditioned on the iteration terminating, element w is output with probability 



(1 + 0) -e £ -pteminate 

\jA ■ p(w). Since (ft < 1/3, we can simplify this to get 



Pr [w output in a given iteration | iteration produces output] G (1 ± 3(ft)p( 



This implies that no matter which iteration produces output, the statistical difference between the distribution 
of w and p{-) will be at most 3(f) = f , as desired. 

Moreover, since each iteration terminates with probability at least • e _<E , the expected number of 

iterations is at most ■ e € < 2e e . Thus, the total expected SQ query complexity of the simulation is 

0{t-e e ). 

It remains to prove the correctness of Q. To estimate p(w) given w, we set up the statistical query g(zi). 
This is a valid query since Pv[R(zi) = w] is a function of z%, and furthermore g(zi) G [—1, 1] for all z% as 
Pr[R(zi) = w]/Pt[R(0) =w]e e ±e . The SQ query result v lies within E Zi ~T>[g(zi)] ± t, where r is the 
tolerance parameter for the statistical query, and so 



~jT> q(w)(e e — e € ) q(w)(e e — e e ) 

Plugging in the bounds for v and q(w) we get that p(w) G (1 ± r')p{w) where r' = e 2e r = ^. This 
establishes (|5]) and concludes the proof. □ 

Claim 5.10. For every interactive local algorithm A making t queries to LR Z , there exists an adaptive sta- 
tistical query algorithm B that in expectation makes 0(t ■ e e ) queries SQx> with accuracy r = 0(/3 / '(e 2e t)), 
such that the statistical difference between B's and A's output distributions is at most [3. 

Proof. As in the previous claim, we show how to simulate the output of the local randomizers during the run 
of the local algorithm. A difference, however, is that because an entry may be accessed multiple times, we 
have to condition our sampling on the outcomes of previous (simulated) applications of local randomizers 

tO Zi. 

More concretely, let R\, i?2, ■•■ be the sequence of randomizers that access the entry z%. To simulate 
Rk(zi), we must take into account the answers oi, . . . , au-\ given by the simulations of R\{zi), . . . , Rk-i(zi) 
We show how to do this using adaptive statistical queries to SQx>. The notation is the same as in Claim [5T9] 
We want to output w G W with probability 

p(w) = Pr [R k (zi) = w | R k ^i(zi) = a k -i, Rk-2(zi) = a fc _ 2 , . . . ,Ri(zi) = ati], 

where Rj (1 < j < k — 1) denotes the jth randomizer applied to zi. 

As before, we start by sampling w ~ R(0). Let q(w) = Pr[Rk(0) = w]. Note that q(w) approxi- 
mates p(w) within a multiplicative factor of e e because R\, . . . , R^ are respectively ei-,. . . , e& -differentially 
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private, and e% + . . . + e& < e. Hence, we can use the rejection sampling algorithm as in Claim 5.9 
Rewrite p{w): 



p(w) 



PT Zi ~v[Rk(zi) = w A R k -i(zi) = a fc _i A • • • A R\(zi) = a{\ 



Pr 



[R k -i(zi) = a fe _i A • • • A R\(zi) = ai] 



E 2 ^p[Pr[i?fc(^) = w A .Rfc-^Zj) = a fc _i A • • • A gi(gi) = QjJ] 
E^~x>[Pr[-R fc _i(-Zi) = a fe _i A • • • A flifo) = at]} 



Conditioned on a particular value of zi, the probabilities in the last expression depend only the coins of 
the randomizers. The outputs of the randomizers are independent conditioned on Zi, and therefore we can 
simplify the expression above: 



p(w) 



Let p\ and p2 denote the numerator and denominator, respectively, in the right hand side of the equation 
above. Let r\{zi) and ^(-Zj) denote the values inside the expectations that define p\ and p2, respectively. 
Namely, 





PT[R k (z 


i) = 


™\-X[)Zl?*[R j {zi) = a j ] 


E Zi ~v 


1 1 / 


'XVv[Rj{zi) = a,j] 





k-l 



k-1 



n(zi) = Pr[R h (zi 



w 



■ Y\ P^[Rj(zi) = dj] and r 2 (zi) = Pr[Rj(zi) = aj] . 



For estimating p\ = E Zi ~T> [ r i( z i)] we use the statistical query g± (z^, and for estimating p 2 
we use the statistical query Q2 (zi) defined as follows: 



T2(Zi)} 



gi(zi) 



ri(zj) - ri(0) 
ri(0)(e £ - e~ e ) 



and 



92{Zi) 



T2(zj) - r 2 (0) 
r 2 (0)(e e - e' e ) 



As in Claim 5.9 one can estimate p\ and p2 to within a multiplicative factor of (1 ± t') where t' = e" c r 
and r is the accuracy of the statistical queries. The ratio of the estimates for p\ and p2 gives an estimate 
p(w) for p(w) to within a multiplicative factor (1 ± 3r'), for r' < |. The estimate p(w) can then be used 
with rejection sampling to sample an output of the randomizer. 

Let t be the number of queries made by A. Setting r 1 < Jj guarantees that the statistical difference 
between distributions p and p is at most |, and hence the statistical difference between £>'s and A's output 
distributions is at most j3. As in Claim 5.9 the expected number of SQ queries for rejection sampling is 
0{t-e e ). □ 



Claims 5.9 and 5.10 imply Lemma 5.8 



□ 



Note that the efficiency of the constructions in Lemma [578] depends on the efficiency of computing the 
functions submitted to the SQ oracle, e.g., the efficiency of computing the probability Pr[R(zi) = w]. We 
discuss this issue in the next section. 



5.2 Implications for Local Learning 

In this section, we define learning in the local and SQ models. The equivalence of the two models follows 
from the simulations described in the previous sections. An immediate but important corollary is that local 
learners are strictly less powerful than general private learners. 
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Definition 5.11 (Local Learning). Locally learnable is defined identically to privately PAC learnable (Def- 
inition 



3.1 ), except for the additional requirement that for all e > 0, algorithm A(e, -,-,•) is e-local and 
invokes LR Z at most poly(d, size(c), 1/e, 1/a, log(l//3)) times. Class C is efficiently locally learnable if 
both: (i) the running time of A and (ii) the time to evaluate each query that A makes are bounded by some 
polynomial in d, size(c), 1/e, 1/a, and log(l/ (3). 

Let X be a distribution over an input domain X. Let SQ c ^x denote the statistical query oracle that takes 
as input a function g : X x {+1, —1} — > {+1, —1} and a tolerance parameter r G (0,1) and outputs v such 
that: \v -E x ~x\g(x,c(x))]\ < r. 



Definition 5.12 (SQ Learning). SQ learnable is defined identically to PAC learnable (Definition 2.4 1, except 



that instead of having access to examples z, an SQ learner A can make poly(d, size(c), 1/a, log(l//3)) 
queries to oracle SQ Cy x with tolerance r > l/poly(d, size(c), l/a,log(l//3)). Class C is efficiently SQ 
learnable if both: (i) the running time of A and (ii) the time to evaluate each query that A makes are 
bounded by some polynomial in d, 1/a, and log(l//3). 

In order to state the equivalence between SQ and local learning, we require the following efficiency 
condition for a local randomizer. 

Definition 5.13 (Transparent Local Randomizer). Let R : D — )■ W be an e-local randomizer. The random- 
izer is transparent if both: ( i) for all inputs u £ D, the time needed to evaluate R; and ( ii) for all inputs 
u G D and outputs w G W the time taken to compute the probability Pt[R(u) = w], are polynomially 
bounded in the size of the input and 1 /e. 

As stated, this definition requires exact computation of probabilities. This may not make sense on a 
finite-precision machine, since for many natural randomizers the transition probabilities are irrational. One 
can relax the requirement to insist that relevant probabilities are computable with additive error at most in 
time polynomial in log(4). 

All local protocols that have appeared in the literature |[29l [3ll2l[Tl l29ll45l[36ll are transparent, at least in 
this relaxed sense. 

In the equivalences of the previous sections, transparency of local randomizers corresponds directly to 
efficient computability of the function g in an SQ query. To see why, consider first the simulation of SQ 
algorithms by local algorithms: if the original SQ algorithm is efficient (that is, query g can be evaluated in 
polynomial time) then the local randomizer R(u) = g(u) + g can also be evaluated in polynomial time for 
all u G D. Furthermore, it is simple to estimate for all inputs u G D and outputs w G W the probability 
Pt[R(u) = w] since R(u) is a Laplacian random variable with known parameters. Second, in the SQ 
simulation of a local algorithm, the functions g(zf) = ^^^fe^^lj^ tnat are constructed can be evaluated 
efficiently precisely when the local randomizers are transparent. 

We can now state the main result of this section, which follows from Lemmas 5.6 and 5.8 along with 
the correspondence between transparent randomizers and efficient SQ queries. 



Zn) 
is 



Theorem 5.14. Let C be a concept class over X. Let X be a distribution over X. Let z = (z\, . . . , & 
denote a database where every Z{ = (xj, c(xi)) with Xi drawn i.i.d. from X and c G C. Concept class C 

6 The standard definition of SQ learning does not allow for any probability of error in the learning algorithm (that is, /3 = 0). Our 
definition allows for a small failure probability /3. This enables cleaner equivalence statements and clean modeling of randomized 
SQ algorithms. One can show that differentially private algorithms must have some non-zero probability of error, so a relaxation 
along these lines is necessary for our results. 
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locally learnable using H by an interactive local learner with inputs a, /3, and with access to LR Z if and 
only if C is SQ learnable using % by an adaptive SQ learner with inputs a, ft, and access to SQ C) x- 

Furthermore, the simulations guarantee the following additional properties: (i) an efficient SQ learner 
is simulatable by an efficient local learner that uses only transparent randomizers; (ii) an efficient local 
learner that uses only transparent randomizers is simulatable by an efficient SQ learner; ( Hi) a nonadaptive 
SQ (resp. noninter active local) learner is simulatable by a noninteractive local (resp. nonadaptive SQ) 
learner. 

Now we can use lower bounds for SQ learners for PARITY (see, e.g., ll39l PT2l 1551 ) to demonstrate 
limitations of local learners. The lower bound of lfl2l rules out SQ learners for PARITY that use at most 
2(2/3 queries of tolerance at least 2~ d / 3 , even (a) allowing for unlimited computing time, (b) under the 
restriction that examples be drawn from the uniform distribution and (c) allowing a small probability of 
error (see Footnote [6]>. Since PARITY is (efficiently) privately learnable (Theorem 4.4 1, and since local 
learning is equivalent to SQ learning, we obtain: 

Corollary 5.15. Concept classes learnable by local learners are a strict subset of concept classes PAC 
learnable privately. This holds both with and without computational restrictions. 



5.3 The Power of Interaction in Local Protocols 

To complete the picture of locally learnable concept classes, we consider how interaction changes the power 
of local learners (and, equivalently, how adaptivity changes SQ learning). As mentioned in the introduction, 
interaction is very costly in typical applications of local algorithms. We show that this cost is sometimes nec- 
essary, by giving a concept class that an interactive algorithm can learn efficiently with a polynomial number 
of examples drawn from the uniform distribution, but for which any noninteractive algorithm requires an 
exponential number of examples under the same distribution. 

Let MASKED-PARITY be the class of functions c r>a : {0, l} d x {0, l} logd x {0, 1} -> {+1, -1} 
indexed by r G {0, l} d and a £ {0, 1}: 



-l) r » if 6=1, 



where r x denotes the inner product of r and x modulo 2, and n is the ith bit of r. This concept class 
divides the domain into two parts (according to the last bit, 6). When 6 = 0, the concept c rA behaves either 
like the PARITY concept indexed by r, or like its negation, according to the bit a (the "mask"). When 6=1, 
the concept essentially ignores the input example and outputs some bit of the parity vector r. 

Below, we consider the learnability of MASKED-PARITY = {c r ^ a } when the examples are drawn from 



the uniform distribution over the domain {0, l} d + l °s d +\ T n Section |5.3.l[ we give a adaptive SQ learner for 
MASKED-PARITY under the uniform distribution. The adaptive learner uses two rounds of communication 
with the SQ oracle: the first, to learn r from the 6=1 half of the input, and the second, to retrieve the bit a 
from the 6 = half o f the input via queries that depend on r. 
In Section 



5.3.2 



we show that no nonadaptive SQ learner which uses 2°^ examples can consistently 
produce a hypothesis that labels significantly more than 3/4 of the domain correctly. The intuition is that 
as the queries are prepared nonadaptively, any information about r gained from the 6 = 1 half of the inputs 
cannot be used to prepare queries to the 6 = half. Since information about a is contained only in the 
6 = half, in order to extract a, the SQ algorithm is forced to learn PARITY, which it cannot do with 
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few examples. Our separation in the SQ model directly translates to a separation in the local model (using 
Theorem [5l4l ). 

The following theorem summarizes our results. 

Theorem 5.16. 

1. There exists an efficient adaptive SQ learner for MASKED-PARITY over the uniform distribution. 

2. No nonadaptive SQ learner can learn MASKED-PARITY (with a polynomial number of queries) 
even under the uniform distribution on examples. Specifically, there is an SQ oracle O such that any 
nonadaptive SQ learner that makes t queries to O over the uniform distribution, all with tolerance 
at least 2~ d / 3 , satisfies the following: if the concept Cf t a is drawn uniformly at random from the 
set of MASKED-PARITY concepts, then, with probability at least g — 2d/ 4 +2 over c^g, the output 
hypothesis h of the learner has err(cf,a, h) > \. 

Corollary 5.17. The concept classes learnable by nonadaptive SQ learners (resp. noninter active local 
learners) under the uniform distribution are a strict subset of the concept classes learnable by adaptive 
SQ learners (resp. interactive local learners) under the uniform distribution. This holds both with and 
without computational restrictions. 



Weak vs. Strong Learning. The learning theory literature distinguishes between strong learning, in which 
the learning algorithm is required to produce hypotheses with arbitrarily low error (as in Definition |2.4| 
where the parameter a can be arbitrarily small), and weak learning, in which the learner is only required 
to produce a hypothesis with error bounded below 1/2 by a polynomially small margin. The separation 
proved in this section (Theorem 5.16) applies only to strong learning: although no nonadaptive SQ learner 
can produce a hypothesis with error much better than 1/4, it is simple to design a nonadaptive weak SQ 
learner for MASKED-PARITY under the uniform distribution with error exactly 1/4. 

In fact, it is impossible to obtain an analogue of our separation for weak learning. The characterization of 
SQ learnable classes in terms of "SQ dimension" by Blum et al. lTT2l implies that adaptive and nonadaptive 
SQ algorithms are equivalent for weak learning. This is not explicit in lfT2l . but follows from the fact that the 
weak learner constructed for classes with low SQ dimension is non-adaptive. (Roughly, the learner works 
by checking if the concept at hand is approximately equal to one of a polynomial number of alternatives; 
these alternatives depend on the input distribution and the concept class, but not on the particular concept at 
hand.) 



Distribution-free vs Distribution-specific Learning The results of this section concern the learnabil- 
ity of MASKED-PARITY under the uniform distribution. The class MASKED-PARITY does not sepa- 
rate adaptive from nonadaptive distribution-free learners, since MASKED-PARITY cannot be learned by 
any SQ learner under the distribution which is uniform over examples with b = (in that case, learning 
MASKED-PARITY is equivalent to learning PARITY under the uniform distribution). Separating adaptive 
from nonadaptive distribution-free SQ learning remains an open problem. 



5.3.1 An Adaptive Strong SQ Learner for MASKED-PARITY over the Uniform Distribution 

Our adaptive learner for MASKED-PARITY uses two rounds of communication with the SQ oracle: first, 
to learn r from the b = 1 half of the input, and second, to retrieve the bit a from the 6 = half of the input 
via queries that depend on r. Theorem 5. 16 part (1), follows from the proposition below. 
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Adaptive SQ Learner ^4 M p for MASKED-PARITY over the Uniform Distribution 

1. For j = 1, . . . , d (in parallel) 

(a) Define gj : D ->■ {0, 1} by 

gj (x,i,b,y) = (i = j) A (6 = 1) A (y = -1) , 
where z G {0, l} d , j G {0, l} lo s d , 6 G {0, 1}, and y = c^a?, i, 6) G {+1, -1}. 

(b) answer j «— SQj)(gj,r), where r = g^py, and rj ■ 



1 if answer j > ^ ; 



I otherwise. 

2. (a) f «- r\ . . . f d G {0, l} d 

(b) Define g d+l : D {0, 1} by 

ffd+1 (x,z,6,y) = (6 = 0) A (y^(-lf°*). 

where x G {0, i G {0,l} lo e d , 6 G {0,1}, and y = c r , a (x,i,6) G {+1,-1}. 

1 if answer d+i > 7', 



(c) answer d+ i <r- SQv{gd+u \)-, and a 

(d) Output Cf a- 



otherwise. 



Proposition 5.18 (Theorem 5.16 part (1), in detail). The algorithm Amp efficiently learns MASKED-PARITY 



( with probability I) in 2 rounds using d+1 SQ queries computed over the uniform distribution with minimum 
tolerance 



1 



4<2+r 

Proof. Consider the d queries in the first round. If r, = 1, then 

E [ gj (x,i,b,y)]= Pr \(i = j) A (6 = 1)] = — . 

(x,i,b,y)<r-V ie u {0,l} lo g d ,66 u {0,l} 2d 

If r*j = 0, then E[gj{x, i, 6, y)] = 0. Since the tolerance r is less than each query y^ reveals the jth bit 
of r exactly. Thus, the estimate fj is exactly rj, and f = r. 

Given that f is correct, the second round query gd+i is always if a = 0. If a = 1, then y^ + i is 1 exactly 
when 6 = 0. Thus E[sd+i(x, i, 6, y)] = | (where a G {0, 1}). Since the tolerance is less than \, querying 
gd+i reveals a: that is, a = a, and so the algorithm outputs the target concept. 

Note that the functions gi, . . . , g d +i are all computable in time O(d), and the computations performed 
by .Amp can be done in time 0(d), so the SQ learner is efficient. □ 

5.3.2 Impossibility of non-adaptive SQ learning for MASKED-PARITY 



The impossibility result (Theorem 5.16[ part (2)) for nonadaptive learners uses ideas from statistical query 



lower bounds (see, e.g., ll39l [1211531 ). 

fofTheorem \5.16\ part (2). Recall that the distribution V is uniform over D = {0, i} d + lo s( d )+ 1 _ For 
ions /, h : {0, fp+iogd+1 recall that err ^^ = Vi x ^ v [f(x) / h(x)]. Define the inner 



Proof 
functions 
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product of / and h as: 

= E /O^O*) = E [/(x)/t(x)]. 

The quantity (/, h) = Pr x ^©[/(x) = — Pr x ^£>[/(x) 7^ M 3 *)] = 1 — 2 • err(f,h) measures the 

correlation between / and h when x is drawn from the uniform distribution V. 

Let the target function c? >a be chosen uniformly at random from the set {c rj(l }. Consider a nonadaptive 
SQ algorithm that makes t queries g%, . . . ,gp The queries gi, . . . ,gt must be independent of f and a since 
the learner is nonadaptive. The only information about a is in the outputs associated with the 6 = half of 
the inputs (recall that Cf,g,(x, i, 6) = (— l) r ' when 6 = 1). 

The main technical part of the proof follows the lower bound on SQ learning of PARITY. Using Fourier 
analysis, we split the true answer to a query into three components: a component that depends on the query 
g but not the pair (f, a), a component that depends on g and f (but not a), and a component that depends on 
g, f, and a (see Equation ([7]) below). We show that for most target concepts Cf : a the last component can be 
ignored by the SQ oracle. That is, a very close approximation to the correct output to the SQ queries made 
by the learner can be computed solely based on g and r. Consequently, for most target concepts Cf,a, the SQ 
oracle can return answers that are independent of a, and hence a cannot be learned. 

Consider a statistical query g : {0, l} d x {0, l} loga! x {0,1} x {+1,-1} -> {+1,-1}. For some 
(x,i,b) G D, the value of g(x,i,b,-) depends on the label (i.e., (g(x, i, b, +1) 7^ g(x, i, b, — 1))) and 
otherwise g(x,i,b,-) is insensitive to the label (i.e., (g(x, i,b, +1) = g(x, i, b, — 1))). Every statistical 
query g (•,•,-, ■) can be decomposed into a label-independent and label-dependent part. This fact was first 
implicitly noted by Blum etal. [ 12 ] and made explicit by Bshouty and Feldman [ 17 ] (Lemma 30). We adapt 
the proof presented in ifTTl for our purpose. 

Let 

, / 7 \ g(x,i,b,l) - g{x,i,b,-l) 1 . 

f g {x,i,b) = and C g = -M\g(x,t, b, 1) + g(x, 1, b, -1)] . 

We can rewrite the expectation of g on any concept c^a in terms of these quantities: 

E[g(x, i, 6, Cf,a(x, i, b))] = C g + (f g , Cf,a) ■ 

Note that C g depends on the statistical query g, but not on the target function. We now wish to analyze 
the second term, (f g , 0). more precisely. To this end, we define the following functions parameterized by 
se{0,l}: 

c s (xib) = {° ifb/s ' and r(xib) = {° ifb ^ S > (6) 

Recall that (f g , Cf^) is a sum over tuples (x, i, b). We can separate the sum into two pieces: one with 
tuples where 6 = and the other with tuples where 6 = 1. Using the functions s , /J just defined, we can 

write (f g , Cr,a) = (f g , 4,a) + </fl > 4,a)- Hence > 

E[0(x,i, 6, cp l8 (a;, *,&))] = + (/g-^) + (/^,4,o>- ( 7 ) 

The inner product (fg,c^ u ) depends on the statistical query g and on f, but not on a. Thus only the 
middle term on the righthand side of ([7]) depends on a. 
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Consider an SQ oracle O = Cf a p that responds to every query (g, r) as follows (recall that V is the 
uniform distribution): 



E[g{x, i, b, c fj a(x, i, b))] otherwise. 



If the condition | (f®, c® a ) \ < r is met for all the queries (g, r) made by the learner, then the SQ oracle O 
never replies with a quantity that depends on a. We now show that this is typically the case. 
Extend the definition of c s f a (Equation^ to any (r, a) G {0, l} d x {0, 1} by defining 

c°(xib)-l° if6 = 1 ' 

c r ^x, i, 0) - i c ^ ^ fc) ^ = ( _ 1)(r ^ >+a ^ if b = 



1/2 if r = r', 
if r ^ r'. 



Note that for r, r' G {0, l} d and a G {0, 1}, 

(r° r° \ = 

We get that {c° o}re{o,i} d * s an orthogonal set of functions, and similarly with {c^. 1 } re ^ nd. The £2 norm 
of c° is ||c° || = \J( c r,oi c ro) = V\/2, so the set {a/2 • c° } r e{o,i} d is orthonormal. A similar argument 
holds for {\/2 • c°i} r6 { 

Expanding the function /° in the orthonormal set {a/2 • c° } rg | nd, we get: 

E (/ g ,V2-c° ) 2 <||/ 9 || 2 = (/°,/ g )<l/2. 

re{0,l} d 

(The first inequality is loose in general because the set {y/2 • Cro}re{0,i} d spans a subset of dimension 2 d 
whereas /° is taken from a space of dimension 2 d+lo & d+l ). Similarly, 

E (/, ^-<i) 2 <ii/ 9 ii 2 = (^/,°)<i/2. 

re{0,l} d 

Summing the two previous equations, we get 

E 2.< /g v5U> 2 <i. 

(r,a)e{0,l} d x{0,l} 

Hence, at most 2 M / 3 ~ 1 functions c r>a can have \ (fg, c^ a )| > l/2 d / 3 . Since f, a was chosen uniformly 
at random we can restate this: for any particular query g, the probability that c® a has inner product more 
than l/2 d / 3 with /° is at most 2 2d l^ 1 /2 d+1 = 2~ d / 3 . This is true regardless of a: since c° = -c° , 

we have |(/°, c° ) | = |(/°, c° >1 }|, so the event that |(/°, c? s }| > l/2 d / 3 happens with probability at most 
2~rf/3 oyer ^ £ Qr q = o, 1. 

Recall that the learner makes t queries, g\ , . . . , g t . Let Good be the event that | (/°. , Cf t a) I < l/2 d / 3 for 
all i G [i] (i.e., the oracle can answer each of the queries independently of a). Taking a union bound over 
queries, we have Pr [Good] > 1 — t/2 d ' 3+2 (where the probability is taken only over f). 

We argued above that there is a valid SQ oracle which, conditioned on Good, can be simulated us- 
ing f but without knowledge of a, as long as all queries are made with tolerance r > l/2 d / 3 (as in 
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the theorem statement). To conclude the proof, we now argue that no nonadaptive strong learner ex- 
ists for MASKED-PARITY over the uniform distribution. For that we concentrate on the 6 = half of 
the inputs, where the outcome of Cf ]S (-) depends on a. Let h be the output hypothesis of the learner. 
For any input (x, i, 0) we have Cf,o(x,i, 0) = — Cf,i(x,i, 0). Thus either Cffl(x,i,0) / h(x,i,0) or 
Cf i(x, i, 0) 7^ h(x, i, 0), and so some choice of a causes the error of h to be at least 1/4. 

Let A be the event that err(h, c? a) > 1/4. Because Good depends only on f, we can think of a as being 
selected after the learner's hypothesis h whenever Good occurs. Thus, Pr[A | Good] > 1/2. Using Good to 
denote the complement of the event Good, we get 

Pr [A] = Pr[A A Good] + Pr[A A Good] 

> Pr[A | Good] Pr [Good] + > -(1 - t/2 d/3+2 ). 

Therefore, Pr[err(h, c f ,a) > 1/4] > |(1 — i/2 d / 3+2 ), as desired. □ 
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A Concentration Bounds 

We need several standard tail bounds in this paper. 

Theorem A.l (Multiplicative Chernoff Bounds (e.g. lfl8l HI)). Let X±, . . . , X n be i.i.d. Bernoulli random 
variables with Pr[Xj = 1] = fi. Then for every (ft £ (0, 1], 




and 
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Theorem A.2 (Real-valued Additive Chernoff-Hoeffding Bound 134]). Let X\, . . . ,X n be i.i.d. random 
variables with E[-Xf] = £t and a < Xi < bfor all i. Then for every S > 0, 



Pr 



> 5 



< 2exp 



-2S 2 n 



Lemma A.3 (Sums of Laplace Random Variables). Let X\, ...,X n be i.i.d. random variables drawn from 
Lap(A) (i.e., with probability density h(x) = ^ exp 



( — Then for every S > 0, 



Pr 



n 



> S 



exp 



S^n 
4A2 



The proof of this lemma is standard; we include it here since we were unable to find an appropriate 
reference. 



Proof. Let S = Yn=i X i- B y tne Mark ov inequality, for all t > 0, 



Pr[5 > Sn] = Pr[e ts > e tdn ] < 



J6ni 



E\e 



tSi 



m s (t) 



otSn 



D tSn 



where ms(t) = K[e ts ] is the moment generating function of S. To compute ms(t), note that the moment 



generating function of X 

m s (t) = (m x (t)) n = (1 



' Lap(A) is mx(t) 
(At) 2 ) 



E e 



tX] 



r , defined for < t < 



Hence 



< exp(ra(At) 2 ), where the last inequality holds for (At) 2 < \. We 

5x2 



get that Pr[5 > Sn] < exp(n((Ai) 2 

A > 1 then (At) 2 = (f A) 2 < \). We get that Pr[S > Sn] < exp (n ((f A) 



as desired. 



tS)). To complete the proof, set t = |A (note that if S < 1 and 

2 -fA))=exp 

□ 



-n?A 2 
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